R PLOTTING - PART1
In the below instructions…
EXERCISE 1: are bits of code to execute/practice pieces to do, often with only hints on how to perform them.
Output indicates the typical output you should expect from a given instruction.
Data
We will work on a simple dataset on cholesterol levels from patients. The data presents cholesterol concentrations in plasma in patients before diet, and after 4 & 8 weeks of diet containing one of two types of margarine. The age group of patients is also indicated.
## ID Before After4weeks After8weeks Margarine AgeGroup
## 1 1 6.42 5.83 5.75 B Young
## 2 2 6.76 6.20 6.13 A Young
## 3 3 6.56 5.83 5.71 B Young
## 4 4 4.80 4.27 4.15 A Young
## 5 5 8.43 7.71 7.67 B Young
## 6 6 7.49 7.12 7.05 A Middle
## ID Before After4weeks After8weeks Margarine
## Min. : 1.00 Min. :3.910 Min. :3.700 Min. :3.660 A:9
## 1st Qu.: 5.25 1st Qu.:5.740 1st Qu.:5.175 1st Qu.:5.210 B:9
## Median : 9.50 Median :6.500 Median :5.830 Median :5.730
## Mean : 9.50 Mean :6.408 Mean :5.842 Mean :5.779
## 3rd Qu.:13.75 3rd Qu.:7.218 3rd Qu.:6.730 3rd Qu.:6.688
## Max. :18.00 Max. :8.430 Max. :7.710 Max. :7.670
## AgeGroup
## Middle:6
## Old :7
## Young :5
##
##
##
Simple R plot
To represent data graphically we have to asign it to the proper plot axis. The simplest way of plotting data in R is by using a built-in function plot(). Variables to plot can be supplied as its arguments, respectively as x (horizontal axis) i y (vertical axis) - or by supplying one argument, so called formula, des cribing the relationship between the dependent and independent variable, or in other words between x & y as y ~ x. Have a look into ?plot() to learn more. Note, that - depending on the situation - you may have to provide just variable names together with the name of the dataset - using data = ... - or to directly call variables in your dataset (e.g. by using ...$variable_x).
EXERCISE 1: Try to recreate the below plot using the loaded data.
Output
EXERCISE 2: Zmodyfikuj wykres za pomocą opcji zmieniających kolor i kształt punktów (podpowiedzi znajdziesz tutaj: https://www.r-graph-gallery.com/6-graph-parameters-reminder.html). Ustaw symbole w postaci wypełnionych, niebieskich kwadratów.
Output
EXERCISE 3: Colours and shapes on a plot can be changed at will. Try to experiment in order to recreate the below plot. Note: you have to creat your own data or input it directly into the plotting function. Information about colour codes can be found here: http://derekogle.com/NCGraphing/resources/colors - in summary, you can choose them by indicating their names (e.g. "hotpink") or specifying a colour’s name in a hexadecimal number system (e.g. "#AA6574").
Output
histogram is a useful and frequently used type of plot - it can be generated using the hist() function.
EXERCISE 4: Create a histogram of 50 ranom samples from a normal distribution with mean 20 and standard deviation 4 (you may want to use the following call rnorm(50, 20, 4)). Output
EXERCISE 5: Redo the histogram by increasing the number of binning intervals. Output (example)
EXERCISE 6: instead of a histogram, distributional data can be presented using a smoothed density of data (kernel density). You can use the built-in function density() to produce such curve, and it can be overlaid on an existing plot using the lines() function (similarly, to the points() function overlaying points, the lines() function does not create a new plot but adds lines to an existing plot ). Try to recreate the above histogram adding to it an overlaid density line: Output (example)
Using ggplot2
Control over the graphical parameters in the plot() function is rudimentary. The ggplot2 package gives much more control over how the plots are made and built. The ggplot2 package is based on so called graphics grammar, a set of rules describing the visual appearance of a plot:
- linking of data to specific elements of a plot (so called mapping) is separated from its actual appearance (i.e., aesthetics);
- the plot has a layered structure, with latter elements appearing on top of the former ones;
- if possible all plot elements should be built on the go, inside of the plotting code, without the need of modifying/transforming the original data.
A simple ggplot2 graph may be structured as follows:
mygraph <- ggplot2(data = MYDATA,
mapping = aes(x = VAR1, y = VAR2, ...)) +
geom_1(OPTIONS) +
geom_2(OPTIONS)
plot(mygraph)
graph2 <- mygraph + geom_3
plot(graph2)Calling the ggplot() function may only create an object of class ggplot without displaying the actual graph. Such object will contain the data and its mappings to specific elements of the final plot. To display it, we need additional function from the geom_... family, which add specific visual elements to the defined mappings (e.g. geom_point adds scatterpoints, geom_hist forms a histogram). Subsequent elements can be concatenated using the + operator. other elements that can be added to the plot using + are display and aesthetic rules, e.g. theme(), which describe the appearance of non-data elements of a plot.
It is a myth that R is incapable of producing a final, publication quality plot that would not have to be modified afterwards :)
Load the ggplot2 - if you don;t have it use install.packages() to install it.
EXERCISE 7: Make a scatterplot similar to one of the previous excersises, mapping the concentrations of cholesterol on the x and y axes. Use blue squares as points. You may want to use the cex option to increase the default symbol size (cex defines a multiplicative coefficient, that increases or decreases plot elements given number of times).
Output
EXERCISE 8: Let’s improve the plot by removing the annoying gray background. add the theme_...() call to the plot (you can review different predefined versions of it here https://ggplot2.tidyverse.org/reference/ggtheme.html) to produce a cleaner graph. Output
EXERCISE 9: Even more aesthetical plot can be produced using the “classic” theme. Try also, by addind the theme() definition to the plot, to modify the text element using the following formatting: element_text(size = 20) - which should increase the default font sizing). Output
EXERCISE 10: Add the geom_smooth aesthetic to the plot, selecting the lm method as its option. Do you know what does lm indicate? Output
## `geom_smooth()` using formula 'y ~ x'
EXERCISE 11: Modify the above call to change the appearance of the regression line. Output
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
EXERCISE 12: Try to map the age groups (
AgeGroup) to the graph points’ colours. Using the alpha option (taking values 0 - 1 - which indicates the transparency of the regression error band) I decreased the cluttered appearance of the plot, making the error bands a bit more subtle. Output
## `geom_smooth()` using formula 'y ~ x'
EXERCISE 12: let’s add the labs() layer with a bit more readable axes names. Output
## `geom_smooth()` using formula 'y ~ x'
EXERCISE 13: An alternative way of coding age gropups - instead of mapping it to colours - may be splitting the groups by so called facets, which presents subsets of data on separate subplots using a common scale. To achieve this you should use the
facet_wrap() function, which takes a formula of the form ~ A, where A indicates a variable from the dataset that defines the split of the graph area into subplots. (An analogous function facet_grid() handles well two-sided formulas A ~ B that define a grid of plots). Try to recreate the below plot - it may look better with the theme_bw() style, instead of the "classic’ one. Output
## `geom_smooth()` using formula 'y ~ x'
EXERCISE 14: Using the
geom_hist() geometry create a histogram of the After8weeks variable. Output
EXERCISE 15: Change the histogram so that it displays relative frequencies of data in each bin, and not absolute counts. Inspiration on how to do this can be found here: https://homepage.divms.uiowa.edu/~luke/classes/STAT4580/histdens.html - there are at least two ways of achieving this goal!
Output
EXERCISE 16: Modify the histogram to add a kernel density estimator to it (it is an analogue of the density() function we have used earlier).
Output
ADDITIONAL EXERCISES
geom_boxplot()can be used to visualise categorical data. Try to produce such plot, showing the cholesterol concentrations before the diet, categorised by age groups. Use?geom_boxplotand if needed the book https://ggplot2-book.org to find out how to achieve this. On such boxplot - what is the meaning of: the boundaries of each box, the ends of the whiskers and the additional points added to the plot?
- A boxplot may be much more informative if we add raw data to it. It can be done in many ways - e.g., to achieve an effect similar to this one; https://bit.ly/31estrN. Try to produce a similar plot using additional data (file
Diet_R.csv, which presents weight loss of patients on three different diets). Before using the data clean it from all missing values (na.omit()).